254 research outputs found

    Logical Segmentation of Source Code

    Full text link
    Many software analysis methods have come to rely on machine learning approaches. Code segmentation - the process of decomposing source code into meaningful blocks - can augment these methods by featurizing code, reducing noise, and limiting the problem space. Traditionally, code segmentation has been done using syntactic cues; current approaches do not intentionally capture logical content. We develop a novel deep learning approach to generate logical code segments regardless of the language or syntactic correctness of the code. Due to the lack of logically segmented source code, we introduce a unique data set construction technique to approximate ground truth for logically segmented code. Logical code segmentation can improve tasks such as automatically commenting code, detecting software vulnerabilities, repairing bugs, labeling code functionality, and synthesizing new code.Comment: SEKE2019 Conference Full Pape

    That Escalated Quickly: An ML Framework for Alert Prioritization

    Full text link
    In place of in-house solutions, organizations are increasingly moving towards managed services for cyber defense. Security Operations Centers are specialized cybersecurity units responsible for the defense of an organization, but the large-scale centralization of threat detection is causing SOCs to endure an overwhelming amount of false positive alerts -- a phenomenon known as alert fatigue. Large collections of imprecise sensors, an inability to adapt to known false positives, evolution of the threat landscape, and inefficient use of analyst time all contribute to the alert fatigue problem. To combat these issues, we present That Escalated Quickly (TEQ), a machine learning framework that reduces alert fatigue with minimal changes to SOC workflows by predicting alert-level and incident-level actionability. On real-world data, the system is able to reduce the time it takes to respond to actionable incidents by 22.9%22.9\%, suppress 54%54\% of false positives with a 95.1%95.1\% detection rate, and reduce the number of alerts an analyst needs to investigate within singular incidents by 14%14\%.Comment: Submitted to Usenix Security Symposiu

    A Language-Agnostic Model for Semantic Source Code Labeling

    Full text link
    Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.Comment: MASES 2018 Publicatio

    Stan: A Probabilistic Programming Language

    Get PDF
    Stan is a probabilistic programming language for specifying statistical models. A Stan program imperatively defines a log probability function over parameters conditioned on specified data and constants. As of version 2.14.0, Stan provides full Bayesian inference for continuous-variable models through Markov chain Monte Carlo methods such as the No-U-Turn sampler, an adaptive form of Hamiltonian Monte Carlo sampling. Penalized maximum likelihood estimates are calculated using optimization methods such as the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm. Stan is also a platform for computing log densities and their gradients and Hessians, which can be used in alternative algorithms such as variational Bayes, expectation propagation, and marginal inference using approximate integration. To this end, Stan is set up so that the densities, gradients, and Hessians, along with intermediate quantities of the algorithm such as acceptance probabilities, are easily accessible. Stan can be called from the command line using the cmdstan package, through R using the rstan package, and through Python using the pystan package. All three interfaces support sampling and optimization-based inference with diagnostics and posterior analysis. rstan and pystan also provide access to log probabilities, gradients, Hessians, parameter transforms, and specialized plotting

    Possible Disintegrating Short-Period Super-Mercury Orbiting KIC 12557548

    Get PDF
    We report here on the discovery of stellar occultations, observed with Kepler, that recur periodically at 15.685 hour intervals, but which vary in depth from a maximum of 1.3% to a minimum that can be less than 0.2%. The star that is apparently being occulted is KIC 12557548, a K dwarf with T_eff = 4400 K and V = 16. Because the eclipse depths are highly variable, they cannot be due solely to transits of a single planet with a fixed size. We discuss but dismiss a scenario involving a binary giant planet whose mutual orbit plane precesses, bringing one of the planets into and out of a grazing transit. We also briefly consider an eclipsing binary, that either orbits KIC 12557548 in a hierarchical triple configuration or is nearby on the sky, but we find such a scenario inadequate to reproduce the observations. We come down in favor of an explanation that involves macroscopic particles escaping the atmosphere of a slowly disintegrating planet not much larger than Mercury. The particles could take the form of micron-sized pyroxene or aluminum oxide dust grains. The planetary surface is hot enough to sublimate and create a high-Z atmosphere; this atmosphere may be loaded with dust via cloud condensation or explosive volcanism. Atmospheric gas escapes the planet via a Parker-type thermal wind, dragging dust grains with it. We infer a mass loss rate from the observations of order 1 M_E/Gyr, with a dust-to-gas ratio possibly of order unity. For our fiducial 0.1 M_E planet, the evaporation timescale may be ~0.2 Gyr. Smaller mass planets are disfavored because they evaporate still more quickly, as are larger mass planets because they have surface gravities too strong to sustain outflows with the requisite mass-loss rates. The occultation profile evinces an ingress-egress asymmetry that could reflect a comet-like dust tail trailing the planet; we present simulations of such a tail.Comment: 14 pages, 7 figures; submitted to ApJ, January 10, 2012; accepted March 21, 201

    Systematizing Confidence in Open Research and Evidence (SCORE)

    Get PDF
    Assessing the credibility of research claims is a central, continuous, and laborious part of the scientific process. Credibility assessment strategies range from expert judgment to aggregating existing evidence to systematic replication efforts. Such assessments can require substantial time and effort. Research progress could be accelerated if there were rapid, scalable, accurate credibility indicators to guide attention and resource allocation for further assessment. The SCORE program is creating and validating algorithms to provide confidence scores for research claims at scale. To investigate the viability of scalable tools, teams are creating: a database of claims from papers in the social and behavioral sciences; expert and machine generated estimates of credibility; and, evidence of reproducibility, robustness, and replicability to validate the estimates. Beyond the primary research objective, the data and artifacts generated from this program will be openly shared and provide an unprecedented opportunity to examine research credibility and evidence

    A New Approach for Assessment of Mental Architecture: Repeated Tagging

    Get PDF
    A new approach to the study of a relatively neglected property of mental architecture—whether and when the already-processed elements are separated from the to-be-processed elements—is proposed. The process of numerical proportion discrimination between two sets of elements defined either by color or by orientation can be described as sampling with or without replacement (characterized by binomial or hypergeometric probability distributions respectively) depending on the possibility to tag an element once or repeatedly. All empirical psychometric functions were approximated by a theoretical model showing that the ability to keep track of the already tagged elements is not an inflexible part of the mental architecture but rather an individually variable strategy which also depends on conspicuity of perceptual attributes. Strong evidence is provided that in a considerable number of trials, observers tagged the same element repeatedly which can only be done serially at two separate time moments
    corecore